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A  Parabolic  Theory  of  Load  Balance 


Alan  Heirich  h  Stephen  Taylor 
Scalable  Concurrent  Programming  Laboratory 
California  Institute  of  Technology 


Abstract 

We  derive  analytical  results  for  a  dynamic  load  balancing  algo¬ 
rithm  modeled  by  the  heat  equation  tt*  =  V2w.  The  model  is  appro¬ 
priate  for  quickly  diffusing  disturbances  in  a  local  region  of  a  compu¬ 
tational  domain  without  affecting  other  parts  of  the  domain.  The  al¬ 
gorithm  is  useful  for  problems  in  computational  fluid  dynamics  which 
involve  moving  boundaries  and  adaptive  grids  implemented  on  mesh- 
connected  multicomputers.  The  algorithm  preserves  task  locality  and 
uses  only  local  communication.  Resulting  load  distributions  approxi¬ 
mate  time  asymptotic  solutions  of  the  heat  equation.  As  a  consequence 
it  is  possible  to  predict  both  the  rate  of  convergence  and  the  quality 
of  the  final  load  distribution.  These  predictions  suggest  that  a  typical 
imbalance  on  a  multicomputer  with  over  a  million  processors  can  be 
reduced  by  one  order  of  magnitude  after  105  arithmetic  operations  at 
each  processor.  For  large  n  the  time  complexity  to  reduce  the  expected 
imbalance  is  effectively  independent  of  n. 


lrThe  research  described  in  this  report  is  sponsored  primarily  by  the  Defense  Advanced 
Research  Projects  Agency,  DARPA  Order  number  8176,  and  monitored  by  the  Office  of 
Naval  Research  under  contract  number  N00014-91-J-1986. 


1.  INTRODUCTION 


The  emergence  of  massively  parallel  computers  has  introduced  new  con¬ 
siderations  for  designers  of  concurrent  algorithms.  Two  primary  issues  when 
dealing  with  irregular  problems  are  load  balancing  and  scalability.  A  vari¬ 
ety  of  algorithms  for  dynamic  load  balancing  exist  for  systems  with  small 
numbers  of  computers  [2].  These  algorithms  are  elegant,  efficient  and  prov- 
ably  correct.  Unfortunately  they  require  that  each  processor  communicate 
information  about  it’s  workload  to  a  central  processor  which  performs  the 
computation.  One  of  our  research  goals  is  to  develop  algorithms  which  scale 
to  the  next  generation  of  fine-grain  mesh- connected  multicomputers  [10,  11] 
which  have  potentially  hundreds  of  thousands  of  processors.  Therefore  we 
seek  algorithms  which  emphasize  nearest  neighbor  communication  and  dis¬ 
tribute  the  computational  work  across  all  of  the  affected  processors. 

We  frequently  encounter  the  need  for  dynamic  load  balancing  in  our  work 
on  computational  fluid  dynamics.  This  occurs  in  finite  volume  or  finite  dif¬ 
ference  simulations  in  which  we  have  partitioned  the  computational  domain 
across  a  set  of  processors  on  a  multicomputer.  We  can  initially  distribute  the 
computational  grid  points  in  such  a  way  as  to  make  the  load  on  each  processor 
approximately  the  same.  The  load  remains  balanced  until  the  computational 
grid  is  adapted  to  follow  shock  waves  or  moving  boundaries.  These  adap¬ 
tations  create  grid  points  in  some  parts  of  the  domain  and  destroy  points 
in  other  parts.  As  a  result  the  loads  on  the  processors  become  unequal  and 
dynamic  rebalancing  is  necessary. 

This  paper  describes  an  efficient  algorithm  to  diffuse  local  load  imbal¬ 
ances  on  a  mesh-connected  multicomputer.  The  algorithm  is  parameterized 
and  can  achieve  any  desired  refinement  of  balance  to  the  limits  of  machine 
precision,  ft  uses  no  global  communication  although  it  does  require  that 
processors  at  exterior  points  communicate  with  their  counterparts  on  the 
opposite  sides  of  the  mesh.  (We  are  presently  considering  strategies  to  elim¬ 
inate  this  requirement.)  The  algorithm  can  be  used  to  rebalance  a  local 
portion  of  a  computational  domain  while  the  normal  computation  executes 
concurrently  over  the  rest  of  the  domain.  The  algorithm  can  be  implemented 
using  policies  to  minimize  the  distances  over  which  tasks  migrate  during  their 
lifetime.  This  is  a  desirable  property  for  many  simulation  problems  because 
it  minimizes  the  cost  of  communication  during  the  simulation. 

We  have  chosen  to  model  the  process  of  rebalancing  by  the  heat  equation 
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ut  —  V2u  on  a  domain  with  periodic  boundary  conditions.  Heat  diffusion 
is  an  apt  analogy  for  the  types  of  problems  that  we  encounter.  A  load 
imbalance  is  typically  caused  by  a  computational  grid  adapting  locally  and 
asynchronously.  This  is  analogous  to  creating  a  heat  source  or  sink  at  a  point 
inside  a  physical  volume.  Just  as  heat  diffuses  away  from  the  source  or  toward 
the  sink  in  the  physical  volume  we  want  work  to  diffuse  away  from  overloaded 
processors  or  toward  underutilized  processors  in  the  multicomputer.  The 
heat  equation  is  an  efficient  numerical  model  for  diffusing  point  sources  as 
it  exhibits  exponential  convergence  to  equilibrium  of  all  Fourier  modes  and 
fast  exponential  convergence  of  high  wavenumber  modes. 

Similar  informal  notions  have  appeared  in  the  literature  [3]  such  as  the 
Rediflow  algorithm  of  Keller  et  al  [8].  Although  such  approaches  are  intu¬ 
itively  persuasive  they  may  also  be  incorrect  and  lead  to  unbalanced  loads 
or  unstable  behavior.  Our  algorithm  is  the  first  “diffusive”  method  for  mesh 
architectures  which  is  scalable  and  has  rigorous  proofs  of  correctness  and  con¬ 
vergence.  It  is  also  the  first  effort  based  on  a  formal  model  of  the  rebalancing 
process  and  the  first  demonstration  that  established  numerical  methods  for 
grid  based  computations  can  lead  to  practical  algorithms  for  mesh-connected 
multicomputers.  We  are  currently  exploring  implementations  of  this  and 
other  models. 

The  first  analytical  work  on  diffusive  dynamic  load  balancing  is  due  to 
Cybenko  [4].  In  addition  to  an  optimal  diffusive  method  for  hypercube  ar¬ 
chitectures  he  presents  an  elegant  analysis  of  a  general  iterative  scheme  for 
arbitrary  interconnection  patterns.  His  scheme  distributes  the  computational 
work  across  all  of  the  affected  processors  but  does  not  restrict  itself  to  nearest 
neighbor  communication.  Hong,  Tan  and  Chen  [6]  appreciate  the  importance 
of  local  communication  and  derive  convergence  results  for  nearest  neighbor 
schemes  on  hypercubes.  Unfortunately  their  results  do  not  apply  to  mesh- 
connected  architectures  and  do  not  take  advantage  of  numerical  solution 
methods  for  differential  equations. 

2.  THE  DIFFUSIVE  HEAT  BALANCING  ALGORITHM 

We  interpret  the  workload  as  a  vector  u  which  is  distributed  across  the  proces¬ 
sors  of  a  three  dimensional  mesh- connected  multicomputer  of  n  processors. 
The  algorithm  simulates  the  passage  of  “artificial  time”  during  which  the 
workload  diffuses  according  to  ut  —  V2u.  After  this  time  has  elapsed  the 
maximum  imbalance  | uXty>z  —  u\  has  been  reduced  by  a  given  factor  a. 


2 


1.  Initialization: 

Choose  a  desired  reduction  a  in  load  imbalance.  Solve  the  following 
inequality  numerically  for  r  (see  table  I  for  example  solutions) 
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where  i,j,  k  are  indexed  from  0  to  (n1/3^  /2  —  1  and  the  case  i  —  j  = 
k  =  0  is  omitted.  Determine  a  parameter  v  related  to  accuracy  of  the 
resulting  solution. 

ln(a) 

"  =  Rife) 

2.  Rebalancing: 

At  every  processor  x,y,z  adjust  the  workload  uXtViZ 


(2) 


for  1=1  to  \t]  /*  outer  loop  */ 
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endf  or  /*  inner  loop  */ 

Exchange  (aujR  —  au'("^  units  of  work  with  every  neighbor  v! . 

it  ^  9#(0 

ux,y,z  ux,y,z 

endf  or  /*  outer  loop  */ 


DISCUSSION 

The  algorithm  consists  of  two  loops.  We  show  in  the  following  sections  of  this 
paper  that  these  loops  compute  an  approximate  solution  to  a  diffusive  process 
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r(a,  n)  n  (total  processors) 


512 

4,096 

32K 

256K 

106 

0.1 

8 

6 

6 

5 

5 

a  0.01 

297 

302 

245 

214 

204 

0.001 

6,605 

12,589 

13,795 

11,044 

9,895 

Table  1:  outer  loop  iterations  as  a  function  of  multicomputer  size  n  and 
quality  of  balance  a. 


described  by  the  heat  equation  ut  =  V2u.  The  outer  loop  is  responsible  for 
advancing  artificial  time  by  a  step  dt.  This  proceeds  until  a  time  has  elapsed 
that  is  sufficient  to  diffuse  the  load  imbalance.  The  inner  loop  is  responsible 
for  computing  the  new  value  of  the  load  at  a  specific  time  step  t  =  (l)dt. 
When  the  inner  loop  has  computed  the  new  value  an  exchange  operation 
causes  the  load  at  each  processor  to  equal  this  value. 

Each  processor  communicates  only  with  it’s  six  neighbors  in  the  three  di¬ 
mensional  mesh.  A  processor  which  is  at  an  exterior  mesh  boundary  may  lack 
as  many  as  three  of  these  neighbors  in  the  physical  mesh.  These  exterior  pro¬ 
cessors  communicate  with  “logical”  neighbors  at  exterior  points  on  opposite 
sides  of  the  mesh.  This  makes  the  mesh  logically  spherical  and  implements 
the  periodic  boundary  conditions  on  which  our  analysis  depends. 

The  cost  of  the  rebalancing  phase  is  the  sum  of  two  costs:  the  cost  to 
compute  the  new  load,  represented  by  the  inner  loop,  and  the  cost  to  ex¬ 
change  work  among  the  processors.  Each  inner  loop  iteration  (3)  requires 
seven  arithmetic  operations  and  six  bidirectional  exchanges  of  a  single  num¬ 
ber  at  each  processor.  The  total  number  of  iterations  is  the  product  tv.  The 
value  of  v  is  always  <  3  for  0  <  a  <  1.  As  table  I  shows  r  is  5  for  a  rebal¬ 
ancing  operation  involving  one  million  processors  when  a  is  0.1.  Assuming  a 
communication  rate  within  a  factor  of  ten  of  the  instruction  rate  the  cost  to 
reduce  the  expected  load  imbalance  on  a  million  processor  multicomputer  by 
a  factor  a  =  0.1  is  about  l0v(a)T(a,  n)  =  10(3)5  =  150  instruction  cycles. 
This  requires  less  than  4  /is  of  real  time  on  a  multicomputer  with  40  MHz 
processors  [11]. 

For  fluid  dynamics  simulations  we  would  like  to  preserve  task  locality 
while  we  redistribute  portions  of  the  computational  domain.  This  is  true  for 
many  simulation  problems  because  communication  rates  are  highest  between 
adjacent  portions  of  the  domain  and  as  a  consequence  communication  cost 
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is  inversely  related  to  locality.  In  this  algorithm  locality  can  be  preserved  by 
using  an  appropriate  exchange  policy.  One  such  policy  would  be  to  assume 
units  of  work  represent  portions  of  a  domain  which  has  been  decomposed 
contiguously  across  the  processors.  Under  this  assumption  each  neighboring 
processor  represents  a  neighboring  portion  of  the  domain.  Any  exchange 
policy  which  minimizes  average  pairwise  distance  between  units  of  work  on 
a  common  processor  maximizes  locality. 

We  would  also  like  to  rebalance  local  portions  of  a  domain  without  having 
to  interrupt  the  rest  of  the  computation.  The  algorithm  can  be  implemented 
in  this  way  simply  by  restricting  it  to  a  rectangular  subportion  of  the  mul¬ 
ticomputer  mesh.  If  the  algorithm  were  restricted  to  a  cube  then  in  the 
following  analysis  the  term  n  would  represent  the  number  of  processors  in 
the  cube.  If  the  restricted  subportion  were  not  a  cube  then  n  would  be  taken 
as  the  length  of  the  longest  side  raised  to  the  third  power.  This  would  ensure 
that  convergence  bounds  reflect  worst  case  estimates. 

The  reader  should  understand  that  the  parameter  a  in  the  following  anal¬ 
ysis  represents  a  reduction  in  the  expected  imbalance  rather  than  a  measure 
of  the  final  imbalance.  Actual  measurements  of  imbalance  require  global 
communication  which  is  contrary  to  our  stated  goals.  Various  implementa¬ 
tion  policies  can  be  formulated  to  achieve  guaranteed  measures  of  imbalance. 
For  example  if  the  size  of  the  initial  disturbance  is  known  then  a  can  be 
set  to  remove  this  disturbance.  In  this  paper  we  avoid  further  discussion  of 
the  numerous  options  which  exist  for  implementation  policies.  Our  goal  is 
to  establish  a  sound  theoretical  basis  on  which  to  implement  practical  al¬ 
gorithms.  We  are  presently  considering  implementation  policies  so  that  we 
can  incorporate  these  results  into  our  ongoing  work  in  computational  fluid 
dynamics. 

3.  DERIVATION  OF  THE  REBALANCING  PHASE 

In  this  section  we  demonstrate  the  relationship  between  our  algorithm 
and  the  process  of  diffusion  described  by  the  heat  equation  ut  =  V2u.  We 
first  derive  a  finite  difference  approximation  to  the  heat  equation.  This  rep¬ 
resentation  is  implicit  which  means  that  in  order  to  advance  the  process  of 
diffusion  it  is  necessary  to  invert  a  coefficient  matrix.  We  choose  to  invert 
the  matrix  by  Jacobi  iteration  and  thus  we  arrive  at  the  inner  loop  (3)  of  the 
algorithm. 
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THE  HEAT  EQUATION  AND  FINITE  DIFFERENCES 
Consider  the  parabolic  heat  equation  in  three  dimensions 

rq  —  V  XI  —  'M’xx  +  Uyy  T  ^ zz  ) 

Taylor  expanding  in  t  with  all  derivatives  evaluated  at  (x,y,z,t) 

u(x,y,  z,t  +  dt)  =  u(x,  y,  z,t)  +  utdt  +  0(dt2) 

u(x,  y,z,t  +  dt)  —  u(x ,  y,  z ,  t ) 
dt 

We  obtain  the  second  order  terms  by  expanding  in  spatial  variables  where 
omitted  coordinates  are  interpreted  as  ( x,y,z,t  +  dt) 


^  +  0  {dt) 


u(x  +  dx,  -,  •,  •)  =  u(x,  •,•,•)  +  uxdx  + 

dr 2  dr3 

Uxx^  +  UXXX^~  +  0(dx*] 

dx^ 

u(x  —  dx, -,-,■)  =  u(x,  ■,  •)  —  uxdx  +  uxx-— 


u7 


dx3 

:T" 


+  0{dxA) 


+  0  (dx2) 


u(x  +  dx, -,-,■)  + u(x  —  dx, -,-,■)  =  2u(x,  •,•,•)  +  uxxdx2  +  0(dx4) 

u(x  +  dx,  -,  •,  •)  +  u(x  —  dx,  •,  •,  •)  —  2 u(x,  ■,  ■,  •)' 

_ 

\  * 

Similar  expansions  in  y,  z  show  that  the  heat  equation  can  be  rewritten 

u(-,  ■,  ■ ,t  +  dt )  -  u(-,  •,  -,t)  _  f  u(x  +  dx,  ■,  ■,  •)  +  u(x  -  dx,  •,  •,  •)  -  2u(-,  •,  •,  •)' 

"  I  dx 2 


dt 


+ 


'«(•,  V  +  dy,  ■,  ■)  +  u(-,  y  -  dy,  ■,  ■)  -  2 u(-,  •,  •)' 


dy 2 


+ 


'u(-, -,z  +  dz,  •)  +  u(-,-,z  -  dz,-)  -  2u(-, -,-,-)' 

dz2 


+  0 (dt,  dx2,  dy2,  dz2) 


Taking  dx  —  dy  =  dz  we  can  express  u(x,y,  z,t)  in  terms  of  time  i  +  dt  and 
approximate  the  heat  equation  to  within  0 (dt,dx2) 

u(x,  y,  z,  t)  «  (l  +  6^)  u{-,  -,-,t  +  dt)  -  -£2  [u{x  +  dx,  -, -,  •)  +  u(x  -  dx,  -,  •) 
+«(•,  y  +  dy,  -,  •)  +  u(-,  y  -  dy,  •)  +  u(-,  -,z  +  dz,  ■)  +  u(-,  -,z-  dz,  •)] 
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Notice  that  if  we  let  a  —  ^  the  preceding  equation  can  be  rewritten 

Ux,y,z(t)  =  (1  T  6a)  UXty,z(t  -f-  dt)  a  \Ux+l,y,z(t  T  dt)  T  Ux—1 ,y,z{t  A  dt) 

+u®,j/+m(£  +  dt )  +  ux>y-itZ(t  +  d£)  +  +  dt)  +  uXjyjZ-i(t  +  dt)\ 

(5) 

SIMULATING  THE  PROCESS  OF  DIFFUSION 

We  can  consider  (5)  as  a  vector  equation  if  we  assign  coordinates  x.,y,z  to 
the  elements  of  a  vector  u  in  the  natural  way, 

u(t)  =  Au(t  +  dt)  (6) 

All  terms  in  u  on  the  right  hand  side  represent  values  at  time  t  A  dt.  A  finite 
difference  scheme  of  this  sort  is  said  to  be  “implicit”  because  values  at  time 
t  are  expressed  as  a  function  of  values  at  time  t  +  dt.  Implicit  schemes  are 
known  to  have  desirable  stability  properties.  In  particular  the  error  terms 
do  not  accumulate  in  successive  solutions  so  the  error  in  the  final  solution  is 
0(a).  In  order  to  compute  solutions  at  successive  time  intervals  dt  we  must 
invert  the  coefficient  matrix  A  to  solve 

u(t  +  dt)  =  A_1u(t)  (7) 

There  are  many  possible  ways  to  compute  A-1u(f).  In  this  paper  we  consider 
the  method  of  Jacobi  iteration  [5]  as  it  maintains  locality  of  communication 
and  therefore  provides  a  scalable  algorithm.  This  method  determines  u(t+dt) 
in  solving  the  system  Au(t  +  dt)  —  u(t)  with  u(t)  known. 

Jacobi  iteration  splits  a  coefficient  matrix  A  —  (D  —  T)  into  a  diagonal 
matrix  D  and  another  matrix  T  with  a  zero  diagonal.  With  a  modicum  of 
algebra  we  can  derive  an  iteration  x =  (D~1T)x^m~1^  +  D~xb  from  any 
problem  Ax  =  b.  This  iteration  is  known  to  converge  to  solutions  of  the 
original  equation  if  all  eigenvalues  of  (D~lT)  lie  within  the  unit  circle  in  the 
complex  plane.  A  Jacobi  iteration  is  easy  to  construct  from  a  given  coefficient 
matrix  A.  D~'  is  found  by  inverting  each  diagonal  term  d;  independently, 
while  T  is  just  the  negation  of  A  with  zero  diagonal  terms.  For  a  finite 
difference  matrix  like  A  in  (6)  the  resulting  (D_1T)  is  just  d_1T  where  d  = 
di  is  the  diagonal  term  which  appears  in  every  row  of  D.  Applying  this 
transformation  to  the  vector  equation  (6)  results  in  the  inner  iteration  (3). 
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4.  DERIVATION  OF  PARAMETERS 


We  have  shown  that  the  finite  difference  representation  of  the  heat  equa¬ 
tion  is  accurate  to  0(a)  and  we  have  chosen  a  to  control  the  quality  of  the 
balance  which  results  from  simulating  the  diffusion  process.  In  this  section 
we  determine  the  accuracy  of  the  converged  Jacobi  iteration  and  derive  a 
formula  for  v  which  guarantees  accuracy  a.  We  then  derive  a  formula  for  r 
which  determines  the  number  of  “artificial  time”  steps  over  which  the  diffu¬ 
sion  occurs. 

ACCURACY  OF  THE  JACOBI  ITERATION 

From  the  Gersgorin  disc  theorem  [7]  we  know  the  eigenvalues  A  of  (3)  are 
bounded  |A|  <  Since  the  row  and  column  sums  are  constant  and  the 

iteration  matrix  is  nonnegative  we  know  further  ([7],  theorem  8.1.22)  that 
the  spectral  radius  equals  the  row  sum 


P 


(d-'t) 


6a 

1  +  6a 


(8) 


Define  the  error  in  a  current  value  u F71)  under  the  iteration  (3)  as  e(v^)  — 
—  u*)  where  u*  is  the  fixed  point  of  the  Jacobi  iteration.  Then  for  v  >  0 

e(uM)  -  e((J9-1r)"u(0))  =  {D^Tf  e^)  (9) 


which  converges  when  p(D~xT )  <  1  since  p((D-1T)")  =  (p(D~1T))1' .  In 
order  for  the  algorithm  to  correctly  reduce  the  error  it  is  necessary  to  compute 
the  desired  load  at  each  time  step  to  an  appropriate  accuracy.  In  order  to 
quantify  the  error  define  the  infinity  norm 


max 

x,y,z 


=(“S«) 


=  max 

x,  y,z 


,(m 


—  U 


* 

x,y,z 


Using  this  norm  we  can  define  a  necessary  condition  for  improving  the  accu¬ 
racy  of  the  solution  u  by  a  factor  a  in  v  steps  to  be  ||e(u(^)||oo  <  a||e(77(0^)  ||oo- 
From  (9)  we  know  that  this  is  satisfied  when  ( p(D_1T ))"  <  a  and  thus  (2) 


ELAPSED  TIME  FOR  THE  DIFFUSION 

In  this  section  we  determine  the  number  of  artificial  time  steps  r  required 
to  reduce  the  load  imbalance  by  a  factor  a.  We  do  this  by  considering  the 
eigenstructure  of  the  finite  difference  equation  (5)  which  we  rearrange  to 
express  the  change  in  load  with  each  iteration 

^x,y,z{t  T  dt)  ®  [^a;+l ,y,z(t  ~t~  dt)  T  Ux—l,y,z(t  T  dt) 

T  dt)  T  U‘x,y—l,z(t  T  dt) 

T  dt)  T  Ux,y,z—\{t  T  dt) 

“6 Ux,y,z{t  T  dt)] 


or  as  a  vector  equation  with  matrix  operator  L 

d(t  +  dt)  —  u(t)  =  aLu{t  +  dt)  (11) 

Any  load  distribution  u(t)  can  be  written  as  a  weighted  superposition  of 
eigenvectors  x  of  L 

U{t)  =  /Lj  ai,j,k{t)xi,j,k 
i,j,k 

Using  this  fact  we  can  rewrite  the  vector  equation  (11)  as 

)  ]  ®j,j,fc(t  T  d^Xijk  ^  )  ®*"iii^(t)®*’iiifc  —  ®  ^  ]  Lctjj)fc(t  T  dtyxij^  (12) 

i,j,k  i,j,k  i,j,k 


Using  the  definition  of  L and  the  eigenvalues  of  L 

L  xi,j,k  =  ~\,j,kxi,j,k 

j 


Si,3,k 


3  —  cos  2-k 


n 1/3 


—  cos  2tt 


n 


1/3 


cos  2tt 


n1/3 , 


we  can  further  simplify  (12) 


y~)  +  dt)xijtk  [1  +  a-i,j,k{t)xi,j,k)  —  0 

ij,k 


(13) 


and  by  the  completeness  and  orthonormality  of  the  eigenvectors 

T  dt)  [1  T  oAtjjfc]  dj,_/,fc(/)  —  0 
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ai,j,k(dt) 


(14) 


ai,j,k(  0) 

1  4“ 

It  is  apparent  from  equation  (14)  that  convergence  of  the  individual  eigen- 
modes  is  strongly  dependent  upon  the  eigenvalues  Xi,j,k-  Reducing  the  am¬ 
plitude  of  an  arbitrary  component  i,j,k  by  a  in  r  steps  of  the  algorithm 
requires  [1  +  a\itjtk]~T  <  a.  The  worst  case  occurs  for  the  smallest  positive 
eigenvalue  Ao,o,i  =  (2  —  2  cos(27r/n1^3))  which  corresponds  to  a  smooth  sinu¬ 
soidal  disturbance  with  period  equal  to  the  length  of  the  computational  grid. 
To  reduce  such  a  disturbance  requires 


In 


a 


In 


1  +  a  (2  -  2  cos  $3 ) 


(15) 


Convergence  of  this  slowest  component  approaches  In  a  1  for  large  n  since 


lim  In 

71— KX) 


1  +  a  (  2  —  2  cos 


2w  \ 

n^J 


=  1 


Convergence  of  highest  wavenumber  component  A(ni/3)/2-i,(n1/3)/2-i,(n1/3)/2-i 
is  rapid  because 


In  a  1 

In  [1  +  (6  —  e)a] 


(16) 


These  characteristics  are  well  suited  to  our  work  in  adaptive  computational 
fluid  dynamics.  The  disruptions  which  occur  are  localized  and  can  be  treated 
as  a  series  of  disruptions  at  individual  processors.  For  these  reasons  we 
consider  the  time  to  diffuse  an  expected  case  in  which  in  a  local  area  of 
a  large  mesh  becomes  imbalanced.  We  consider  the  length  of  time  which 
must  pass  before  the  imbalance  is  reduced  by  a  factor  a.  Our  conclusion  is 
equation  (1). 

In  the  following  text  we  use  the  Poisson  bracket  (•,  •)  to  represent  the  inner 
product  operator.  When  discussing  loads  or  eigenvectors  we  use  u[x,y,z]  or 
%i,j,k[%,  y,  z]  to  denote  the  vector  element  which  corresponds  to  location  x,y,z 
of  the  computational  grid  with  the  convention  that  [0,  0, 0]  is  the  first  element 
of  the  vector.  Then  the  initial  disturbance  confined  to  a  particular  processor 
x,y,z  can  be  written  as  a  superposition  of  eigenvectors  of  L 


«(0)  —  'y  ^  ^^,m,n(0)^'i,m,n 

l,m,n 


(17) 
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If  we  assume  u(0)  to  be  zero  at  every  element  except  [x,  y,  z]  then 

(xi,j,k,  u( 0))  =  xitjtk[x,  y,  z]  (18) 

This  is  equal  to  the  initial  amplitude  ahJt/;(0)  of  each  eigenvector  Xijtk 

0))  = 


The  computational  domain  has  periodic  boundary  conditions  and  as  a  re¬ 
sult  the  origin  of  the  coordinate  system  is  arbitrary.  Then  without  loss  of 
generality  we  can  place  the  origin  at  the  source  of  the  disturbance  and  take 
x  =  y  =  z  =  0.  This  has  no  effect  on  the  eigenvectors  Xij  k  and  from  (18), 
(19) 

0, 0]  (20) 

Placing  the  origin  at  the  source  of  the  disturbance  is  particularly  convenient 
when  we  consider  the  first  element  of  the  eigenvectors  £;j,fc[0, 0, 0].  L  has 
/2  distinct  eigenvalues  each  of  algebraic  multiplicity  two.  Each 
of  these  eigenvalues  has  geometric  multiplicity  eight,  ie.  has  eight  linearly 
independent  associated  eigenvectors  of  unit  length 

xi,hk[x->y->z\  =  ci,j,kF\  f2  (2%—^  f3  (21) 

where  each  F{  is  either  sin  or  cos.  By  choosing  x  =  y  =  z  =  0  this  expression 

(21)  is  zero  except  for  the  single  eigenvector  for  which  Ti(x)  =  F2(x)  = 
F3(x)  =  cos(x).  As  a  result  without  loss  of  generality  we  can'  restrict  our 
consideration  to  initial  disturbances  of  the  form 

«[0, 0, 0](0)  =  I]  cyA-jbp,  0, 0]  =  y  c\hk  (22) 

i,j,k  i,j,k 


(  Xi,j,ki  £  (0  )xi  ,m,n  j 

\  l,m,n  t 

£  (Xi,j  (o) 

l,m,n 

kn 


ahj,k(  0) 


(19) 
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From  (14)  we  define  the  time  dependent  disturbance  at  any  location  x\  y',  z' 
U\x  >  y  ?  z  ](Tdt)  =  )  '  Ci,j,k  [1  T  xi,j,k[x  5  2/  >  z  ] 

i,j,k 

=  X/  Ci,j,k  "F  a^i,j,k] 

cos  (2,r  JF)  cos  ( [2n <23> 

In  the  appendix  we  show  that  =  (8/n)1/2  and  thus  the  disturbance  is  a 
summation  of  equally  weighted  eigenvectors.  ^Frorn  (22)  and  (23)  the  time 
dependent  disturbance  at  0, 0, 0  is  therefore 


“[0,0,0]  (rift)  = 

n  ij,k 

(24) 

Solving  u[0, 0, 0](rdt)  <  a  yields  equation  (1). 

5.  ALGORITHM  FOR  A  TWO-DIMENSIONAL  MESH 

The  algorithm  for  the  two  dimensional  case  is  very  similar  to  the  three 
dimensional  case.  The  inequality  for  r  becomes 


1  +  a2  3  —  cos 


27 xi 
1/3 


—  cos 


27 rj 


n 


n 


1/3 


cos 


27 vk 

TJz 


n 


n 


£ 


„  ,  ^  27T7  27T7  Y  ' 

1  +  a:2  (  2  —  cos  — jyj  —  cos 


The  new  spectral  radius  of  the  iteration  matrix  (3)  chang 


<  a 

es  v 


ln(a) 

(i4fe) 


and  similarly  the  iteration  itself 


u 


(m)  _ 

xtV 


a 


1+4  a 


(™— i) 
x+l  ,y 


+  u 


(m-l) 

x-l,y 


+  U 


(m-i) 
x,y+ 1 


+  u 


(m-l) 

x,y-l 


Up 

1+4  a 
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6.  CONCLUSIONS 


We  have  demonstrated  that  by  starting  with  a  model  of  the  rebalancing 
problem  and  applying  established  numerical  methods  to  that  model  we  arrive 
at  a  scalable  and  concurrent  load  balancing  algorithm  for  a  mesh-connected 
multicomputer.  The  essential  features  of  this  algorithm  are  that  it  diffuses 
local  disturbances  rapidly,  it  can  preserve  task  locality,  and  it  can  rebalance 
local  portions  of  a  domain  without  interrupting  the  rest  of  the  computation. 

We  do  not  claim  that  this  algorithm  is  optimal  and  we  identify  a  worst 
case  in  which  convergence  could  be  very  slow  for  large  n.  One  strategy  to 
remove  this  worst  case  behavior  would  be  to  embed  this  algorithm  within  a 
multigrid  computation  [9]  in  which  balances  are  computed  between  coarser 
subdivisions  of  the  architectural  mesh.  We  are  continuing  the  theoretical 
work  on  this  problem  by  considering  these  sorts  of  alternative  numerical 
methods  for  the  heat  equation  model  as  well  as  alternative  models  for  load 
balancing.  We  are  continuing  the  practical  work  by  implementing  this  al¬ 
gorithm  in  an  adaptive  finite  volume  flow  solver  for  irregular  problems  with 
moving  boundary  conditions.  We  expect  this  experience  to  give  us  insight 
into  the  policy  decisions  necessary  to  develop  a  practical  implementation. 
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APPENDIX:  NORMALIZATION  CONSTANT 

In  this  appendix  we  demonstrate  that  the  eigenvector  normalization  con¬ 
stant  Cij'k  is  equal  to  (8 /n)1/2  for  all  eigenvectors  xhJ^-  ^From  (21)  a  neces¬ 
sary  condition  for  a  normalized  eigenvector  is 


=  4E»’  (2^)  “s2  (2l  Js)  “s2  (2* 

=  ch,k\  £  (x  + cos47r^)  (x  + cos47r 


zk 

173 


n 


x,y,z 


yj 

n1/3 
zk 


1  +  cos  47T- 


zk 


n 


1/3 


=  C-’fc8^{(1  +  COs47r^)(1  +  COs47r 
+ x;  cos  (4^^)  £  c°s  (4,r^)  i1 + cos  4w 


zk  ' 

NTs 


n 


(25) 


We  can  simplify  the  preceding  expression  if  we  make  use  of  the  following 

Lemma  1 

xi 


COS  (  47T 


n1/3 


=  0 


Proof: 


. — ,  ,42E5i 

=  Re  e  11 

X 

= 


47Tt  \  ^ 


=  Re 


=  0 


e  n- 


e  nx 


47ft  / 

373  1 1 


e  n1 


,  4ir» 

1  -  eV'3 


Q.E.R. 
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Repeated  application  of  lemma  (1)  to  equation  (25)  yields 


1  =  ^4s(i+cos4^)(i+cos4t 


zk  ' 
17s 


n 


c2  - 

t,j,k  g 


r2  : 


£(1  +  C“4’r^)+?.(“S4’rn‘/3 

£i  +  e(' 


x,y,z 


COS  4-7T- 


Lx,y,z  x,y,z  \ 


CiJ,kgn 


1  +  cos  47 r 


n1/3 


(26) 


From  which  we  conclude 


ci,j,k  — 


V 
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